Instructions

MetAltas analysis workflow

This notebook is for performing a targeted metabolomics analysis on your raw data files. In order for this to work, you data files must be uploaded to the nersc raw data directory with appropriate permissions. A fileconverter runs at nersc to convert as follows: .raw > .mzml > .h5 > pactolus.gz This conversion must happen before you can proceed with data analysis. A Northen Lab staff member can help you transfer and convert your files if they are not already at nersc.


/global/project/projectdirs/metatlas/raw_data/SUBFOLDER


A targeted analysis requires a list of compounds to search for in your sample files. Thus, at a minimum, you must have some information on their m/z ratio in order to get started. The inputs for this notebook, include the location of your raw files, and an atlas listing out these compounds.

Our lab has analyzed thousands of compounds using standardized LCMS methods. You may use one of these libraries to create an atlas for your sample analysis, assuming you used one of the same standard LCMS methods.

Analysis steps

  1. Create your atlas csv external to jupyter: typically you will either have a custom atlas or generate an atlas from one of our EMA libraries. These can be found in the shared google drive, under atlases.
  2. Run through with your initial atlas
  3. Use the RT adjuster to mark the quality of the peaks and MSMS matches, and also adjust your RT bounds
  4. Reload your atlas from the "get atlas" block
  5. Export your atlas, filter out the compounds marked under id notes as remove. Save and upload your new atlas
  6. Repeat the analysis using your filtered atlas.
  7. On your final export, make sure you are excluding system blanks and QC files in your output. If you have extraction blanks you should include those.

There is a database that will store your atlas and registered run files. During the analysis process you will pull these as local variables. These are used to pull out EIC and MSMS data which is stored in a variable called metatlas_dataset. All output files are generated from that. It is a local variable so is not stored in the db - thus if your kernel dies partway through, you need to rerun through the notebook to regenerate that variable.

IMPORTANT: Anytime you adjust your RTs in the interactive plot, you need to retreive your atlas again before exporting any files. The changes are stored in the db and the exports are made from locally stored variables/dataframes.

index

  1. Import python packages
  2. Create groups
  3. Select groups
  4. Create Atlases
  5. Select Atlas
  6. Annotate data
  7. Correct RT bounds
  8. Export results
  9. Tools

Helpful tips

  1. When entering search strings, use % for wildcard: %peas will return "positive_peas" and "positive_greenpeas", but not "positive_greenpeas_HILIC", for that use %peas%, you can also put multiple strings via %pos%peas%
  2. Set your kernel to mass_spec_cori. If not sure how, check w Ben or Daniel.

In [ ]:
# optional: run this block to adjust the width of the notebook.  Change the width percent.

from  IPython.core.display  import  display, HTML 
display(HTML("<style>.container { width:100% !important; }</style>"))

1. Import Python Packages

Instructions:

  1. On the 3rd line of the block below, add your directory where your most recent metatlas code is stored (to the sys.path.insert line).
  2. Only run this block once to load the metatlas python modules. If the kernel dies, you will need to rerun this block.
  3. Run the block with the "print fa.file" code line to double check that the printed output matches the directory in the sys.path.insert line of the prior block. If it does not match, then on the "kernel" dropdown, click restart and correct the path below before trying again.

In [2]:
%matplotlib notebook
import sys, os

# v edit this line
sys.path.insert(0,'/global/homes/FIRST-INITIAL-OF-USERNAME/USERNAME/REPOFOLDER/metatlas-master-20200416/metatlas-master/')
# ^ edit this line

from metatlas.helpers import fastanalysis as fa
from metatlas.helpers import dill2plots as dp
from metatlas.helpers import metatlas_get_data_helper_fun as ma_data
from metatlas.helpers import chromplotplus as cpp
import metatlas.metatlas_objects as metob
import qgrid
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
from IPython.display import display
import time
import pickle
import dill
import multiprocessing as mp
import pandas as pd
import matplotlib.pyplot as plot
import operator

pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', 100)

In [3]:
# check that first block registered metatlas directory correctly.  If it does not look the same, seek help!
print fa.__file__


/global/common/software/m2650/python-cori/lib/python2.7/site-packages/metatlas/helpers/fastanalysis.pyc

Set atlas, project and output directories from your nersc home directory

STEP 1

  1. Create a project folder name for this analysis by replacing the PROJECTDIRECTORY string text in red below. Make sure to update the rest of the direcory to point to your home directory. The pwd block will print out the directory where this jupyter notebook is stored.
  2. Create a subdirectory name for the output, each run through you may want to create a new output folder.
  3. When you run the block the folders will be created in your home directory. If the directory already exists, the block will just set the path for use with future code blocks.

STEP 2

  1. Enter the nersc path where you have stored your atlas files using one of the two optional lines below:
    1. Set a custom path
    2. Use the project and output subdirectory from above

In [5]:
pwd


Out[5]:
u'/global/u2/s/smkosina/Workflow notebooks'

In [5]:
#STEP 1
project_directory='/global/homes/FIRST-INITIAL-OF-USERNAME/USERNAME/PROJECTDIRECTORY/'  # <- edit this line, do not copy the path directly from NERSC (ex. the u1, or u2 directories)
output_subfolder='HILIC_POS_20190830/'  # <- edit this as 'chromatography_polarity_yyyymmdd/'
output_dir = os.path.join(project_directory,output_subfolder)

if not os.path.exists(project_directory):
    os.makedirs(project_directory)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

#STEP 2
pathtoatlas='/global/homes/FIRST-INITIAL-OF-USERNAME/USERNAME/DIRECTORYOFATLAS/'    # <- enter the directory where you have stored your atlas
#pathtoatlas= '%s%s' % (project_directory,output_subfolder)

2. Create Groups (named variables that hold your replicates of each sample)

You must assign your raw files into experimental groups for analysis. These are used for downstream statistics and for selection of specific groups for filtering to subsets of files for analysis (Ex. just pos or just neg).

The groups are created from common file headers and the unique group names. The convention our lab group uses for filenames is as follows:


DATE_NORTHENLABINITIALS_COLLABINITIALS_PROJ_EXP_SAMPSET_SYSTEM_COLUMN-method_SERIAL_POL_ACQSAMPLENUMBER SAMPLEGROUP_REP_OPTIONAL_SEQ

Ex.:20180105_SK_AD_ENIGMA_PseudoInt_R2ADec2017_QE119_50454_123456_POS_MSMS_001_Psyringae-R2A-30C-20hr_Rep01_NA_Seq001.raw


The common header consists of the fields 0-10: DATE_NORTHENLABINITIALS_COLLABINITIALS_PROJ_EXP_SAMPSET_SYSTEM_COLUMN-method_SERIAL_POL_ACQ

The sample group name is commonly field # 12 (between underscore 11 and 12) -0 indexed-

Find your files

  1. On the first line of the block below, set the 'experiment' and 'name' variables to find your files. These fields require wildcards for partial string searches
  2. 'Experiment' is the folder name within global/project/projectdirs/metatlas/raw_data, that will be emailed to you when the files are uploaded to NERSC. You can also look in the raw_data directory for the NERSC user who uploaded your files; your experiment folder should be in there.
  3. 'name' is string that will match a subset of your files within that folder.

In [ ]:
files = dp.get_metatlas_files(experiment = '%ENTERSTRING%',name = '%ENTERSTRING%',most_recent = True)
# ^ edit the text string in experiment and name fields

df = metob.to_dataframe(files)
my_grid = qgrid.QGridWidget(df=df[['experiment','name','username','acquisition_time']])
my_grid.export()

In [ ]:
len(files)

OPTION A: Automated Group Maker

This will attempt to create groups in an automated fashion (rather than filling out a spreadsheet with a list of files and group names). If your files are all in one folder at nersc, you can use this options. If not, use option B below.

A long group name consisting of the common header + either controlled vocab value or field #12 along with a short group name (just controlled vocab or field #12) will be stored in a local variable. The short group names can be used on plots.

  1. STEP 1: View the groups
    1. Pick an experiment folder to look for files in on the metob.retrieve function
    2. Enter controlled vocabulary for control files to put select files into groups when control string may be in a different field (not #12) or as a randomly placed substring within a field (ex. if 'InjBl' is included in your controlled vocab list, files like InjBl-MeOH and StartInjBl will group together)
    3. If your group name is not between _ 11 and 12 you can adjust those values in the split commands below. All other (non-controlledvocab) groups will be created from that field.
  2. STEP 2: Create the groups variable after checking the output from STEP 1
  3. STEP 3:
    Option A: If everything looks fine the group names and short names, Store groups once you know you have files in correct groups by running and checking the output of STEPS 1 and 2.
    Option B (optional): If you would like to edit the groups, uncomment the options B-I and B-II. Run Option B-I to export a prefilled tab infosheet. Edit the file and then run Option B-II to import the new groups and save it.

In [ ]:
#STEP 1: View the groups

files = metob.retrieve('lcmsruns',experiment='%ENTERSTRING%',username='*')
controlled_vocab = ['QC','InjBl','ISTD'] #add _ to beginning. It will be stripped if at begining
version_identifier = 'vrs3'
file_dict = {}
groups_dict = {}
for f in files:
    k = f.name.split('.')[0]
    #     get index if any controlled vocab in filename
    indices = [i for i, s in enumerate(controlled_vocab) if s.lower() in k.lower()]
    prefix = '_'.join(k.split('_')[:11])
    if len(indices)>0:
        short_name = controlled_vocab[indices[0]].lstrip('_')
        group_name = '%s_%s_%s'%(prefix,version_identifier,short_name)
        short_name = k.split('_')[9]+'_'+short_name # Prepending POL to short_name
    else:
        short_name = k.split('_')[12]
        group_name = '%s_%s_%s'%(prefix,version_identifier,short_name)
        short_name = k.split('_')[9]+'_'+k.split('_')[12]  # Prepending POL to short_name
    file_dict[k] = {'file':f,'group':group_name,'short_name':short_name}
    groups_dict[group_name] = {'items':[],'name':group_name,'short_name':short_name}
df = pd.DataFrame(file_dict).T
df.index.name = 'filename'
df.reset_index(inplace=True)#['group'].unique()
df.drop(columns=['file'],inplace=True)
for ug in groups_dict.keys():
    for file_key,file_value in file_dict.items():
        if file_value['group'] == ug:
            groups_dict[ug]['items'].append(file_value['file'])
df.head(100)

In [ ]:
#STEP 2: create the groups variable, if the above looks OK

groups = []
for group_key,group_values in groups_dict.items():
    g = metob.Group(name=group_key,items=group_values['items'],short_name=group_values['short_name'])
    groups.append(g)        
    for item in g.items:
        print(g.name,g.short_name,item.name)
    print('')

In [ ]:
# STEP 3 Option A: store the groups variable content in the DB (currently only the long group name is stored)
metob.store(groups)

In [ ]:
## STEP 3 Option B-I: OPTIONAL: Export groups to csv file for editing (filename, short_name, group, description)
#dp.make_prefilled_fileinfo_sheet(groups,os.path.join(output_dir,'prefilled_fileinfo.tab'))

In [ ]:
## STEP 3 Option B-II: Import groups from csv file after editing the prefilled_fileinfo.tab
#groups = dp.make_groups_from_fileinfo_sheet(os.path.join(output_dir,'prefilled_fileinfo.tab'), filetype='tab', store=True)

OPTION B: Register LCMS Runs into categorical groups from a file.

Typically, you will make one fileinfo sheet with all of your files (pos and neg) for this experiment. At a minimum, group names MUST contain the first 11 underscore delimited fields (DATE_NORTHENLABINITIALS_COLLABINITIALS_PROJ_EXP_SAMPSET_SYSTEM_COLUMN-method_SERIAL_POL_ACQ) and the 'SAMPLEGROUP' field.

Files can be from multiple folders at nersc.

  1. STEP 1: select files
    1. Edit the experiment and name fields to find the files you want.
  2. STEP 2: create and save a .tab file to your project directory.
    1. After running the block, find the .tab file in your project directory.
    2. Open in excel or other spreadsheet editor.
    3. Fill out the group names as per above in the editor.
    4. Save the file as filled_fileinfo.txt
  3. STEP 3: Create groups from spreadsheet
    1. Transfer the .txt. file back to your project directory
    2. Run the make groups block using store=False
  4. STEP 4: CHECK groups
    1. Run the next block 'metob.to_dataframe(g) and check that the information looks correct
  5. If it is correct, rerun the STEP 2 make groups block, using store=True. If not, fix your file in excel and redo Steps 2&3

In [ ]:
#STEP 1: Select files
files = dp.get_metatlas_files(experiment = '%ENTERSTRING%',name = '%ENTERSTRING%',most_recent = True)
# ^ edit the text string in experiment and name fields

In [ ]:
#STEP 2: Save spreadsheet file
dp.make_empty_fileinfo_sheet('%s%s' % (output_dir,'empty_fileinfo.tab'),files)

In [ ]:
#STEP 3: create groups from file
g = dp.make_groups_from_fileinfo_sheet('%s%s' % (output_dir,'filled_fileinfo.txt'),
                                       filetype='tab',
                                       store=True)

In [ ]:
# STEP 4: check groups
metob.to_dataframe(g)

3. Select groups of files to operate on

Here, you will assign your database groups to a local variable which will be used downstream in the notebook for analyzing your data with an atlas.

  1. in block below, fill out the fields for name, include_list and exclude_list using text strings from the group names you created in the previous step. The include/exlcude lists do not need wildcards. Name is a string unique to all of your groups (ex. fields 0-11 of your filenames)

Typically, you will run one polarity at a time.


In [ ]:
groups = dp.select_groups_for_analysis(name = '%ENTERSEARCHSTRING%',    # <- edit text search string here
                                       most_recent = True,
                                       remove_empty = True,
                                       include_list = [], exclude_list = ['NEG','QC','InjBl'])# ex. ['QC','Blank'])
print("sorted groups")
groups = sorted(groups, key=operator.attrgetter('name'))
for i,a in enumerate(groups):
    print(i, a.name)

In [ ]:
# to view metadata about your groups, run the block below
metob.to_dataframe([groups])

4. Create new Atlas entries in the Metatlas database from a csv file

QC, IS, and EMA template atlases are available on the google drive.

  1. Create your atlas as a csv file, check that it looks correct (has all the correct headers and no blank values in rows; all columns are the correct data type
  2. Save it with the type of atlas (EMA, QC or IS), your initials, the experiment name, the polarity, and the version or timestamp
  3. Upload it to your nersc project directory (the one you named above). (If it doesn't work, double check your file permissions are set to at least rw-rw----).
  4. Run blocks below to create the DB entries for negative and positive mode atlases
  5. WARNING: Don't run this block over and over again - it will create multiple new DB entries with the same atlas name

Required Atlas file headers:

inchi_key,label,rt_min,rt_max,rt_peak,mz,mz_tolerance,adduct,polarity,identification_notes

values in rows must be completed for all fields except inchi_key (leaving this blank will not allow you to perform MSMS matching below), and identification notes

INFO: store=True will register your atlas in the database. If you are not sure if your atlas structure is correct, set store=False for the first time your run the block to check if you get an error. If there is no error, then rerun it with store=True.

NEGATIVE MODE ATLAS UPLOAD


In [ ]:
atlasfilename='EMA_SK_20190703_UVMoss_Chamber0206_SolUVAC_QE119_HIL_NEG_20190829_1735'  # <- enter the exact name of your csv file without the file extension

names = dp.make_atlas_from_spreadsheet('%s%s%s' % (pathtoatlas,atlasfilename,'.csv'),  # <- DO NOT EDIT THIS LINE
                                       atlasfilename,
                                       filetype='csv',
                                       sheetname='',
                                       polarity = 'negative',
                                       store=True
                                       #,mz_tolerance = 20   #Uncomment this line if you want to override the mz_tolerance values in your atlas, otherwise, the code will use your values in the csv sheet
                                      )

POSITIVE MODE ATLAS UPLOAD


In [ ]:
atlasfilename='EMA_SK_20190703_UVMoss_Chamber0206_SolUVAC_QE119_HIL_POS_20190829_1735'   # <- enter the exact name of your csv file without the file extension

names = dp.make_atlas_from_spreadsheet('%s%s%s' % (pathtoatlas,atlasfilename,'.csv'),  # <- DO NOT EDIT THIS LINE
                                       atlasfilename,
                                       filetype='csv',
                                       sheetname='',
                                       polarity = 'positive',
                                       store=True
                                       #,mz_tolerance = 20   #Uncomment this line if you want to override the mz_tolerance values in your atlas, otherwise, the code will use your values in the csv sheet
                                      )

5. Select Atlas to use

  1. The first block will retrieve a list of atlases matching the 'name' string that you enter. Also, you must enter your username.
  2. The next block will select one from the list, using the index number. Make sure to enter the index number for the atlas you want to use for your analysis by setting in this line: my_atlas = atlases[0]

In [ ]:
atlases = metob.retrieve('Atlas',name='%UVMoss_Chamber0206_SolUVAC%POS%',username='YOUR-NERSC-USERNAME')
names = []
for i,a in enumerate(atlases):
    print(i,a.name,pd.to_datetime(a.last_modified,unit='s'))#len(a.compound_identifications)

In [ ]:
my_atlas = atlases[0]
atlas_df = ma_data.make_atlas_df(my_atlas)
atlas_df['label'] = [cid.name for cid in my_atlas.compound_identifications]
print my_atlas.name
metob.to_dataframe([my_atlas])
# the first line of the output will show the dimensions of the atlas dataframe

In [ ]:
# OPTIONAL: to view your atlas, run this block
print my_atlas.name
atlas_df

6. Get EICs and MSMS for all files in your groups, using all compounds in your atlas.

This block builds the metatlas_dataset variable. This holds your EIC data (mz, rt, intensity values within your mz and rt ranges).

The EIC data contains mz, intensity and RT values across your RT range. There are two parameters that you will need to edit: extra_time and extra_mz. Extra time will collect mz, intensity and RT values from outside of your atlas defined min and max rt values. For example if your rt_min = 1.0, and rt_max = 2.0 and you set extra_time to 0.3, then your new rt range will be 0.7 to 2.3. This is helpful for checking if you have nearby peaks at the same m/z. Extra_mz should only be used for troubleshooting. You should keep this at 0 unless you believe you have poor mass accuracy during your run. Other ways to address this issue is by changing the mz_tolerance values in your atlas. Before changing this value, you should check in with a metatlas experienced lab member to discuss when/how to use this value.

  1. Change the value in "extra_time = 0.0" to something like 0.5 to 1.0 for the first EMA runthrough on your files. This will take longer but collect msms outside your retention windows which allows you to check the msms of nearby peaks before adjusting your rt bounds around the correct peak.
  2. extra_mz should almost always be set to 0.0 If you need to troubleshoot a low mz compound you could potentially use this value to run it back through with a larger mz error window than what was specified in your atlas (ppm tolerance).

On Your final runthrough, set extra_time to 0


In [ ]:
all_files = []
for my_group in groups:
    for my_file in my_group.items:
        extra_time = 0.1  
        extra_mz = 0.00
        all_files.append((my_file,my_group,atlas_df,my_atlas,extra_time,extra_mz))
pool = mp.Pool(processes=min(4, len(all_files)))
t0 = time.time()
metatlas_dataset = pool.map(ma_data.get_data_for_atlas_df_and_file, all_files)
pool.close()
pool.terminate()
print time.time() - t0

In [ ]:
# Make data sources tables (atlas_metadata.tab, groups_metadata.tab, groups.tab and [atlasname]_originalatlas.tab within data_sources subfolder)
ma_data.make_data_sources_tables(groups, my_atlas, output_dir)

One of the two blocks below builds the hits variable. This holds your MSMS spectra (from within your mz, and rt ranges, and within the extra time indicated above).

There are two options for generating the hits variable:

  1. block A: use when your files have msms. It create the hits variable and also saves a binary (pickled) serialized hits file to the output directory.
  2. block B: only run if your files were collected in MS1 mode
  3. If you have already run block A and then the kernel dies, you can skip block A and directly unplickle the binary hits file from the output directory. Skip block A, uncomment the Optional block and run it.

In [ ]:
##BLOCK A
t0 = time.time()

hits=dp.get_msms_hits(metatlas_dataset,extra_time=True,keep_nonmatches=True)
pickle.dump(hits, open(os.path.join(output_dir,'hits.pkl'), "wb"))

print time.time() - t0
print '%s%s' % (len(hits),' <- total number of MSMS spectra found in your files')

In [ ]:
## BLOCK B (uncomment lines below to run this.  Only use when all data files are MS1)
#hits=pd.DataFrame([], columns=[‘database’,‘id’,‘file_name’,‘msms_scan’, u’score’, u’num_matches’, u’msv_query_aligned’, u’msv_ref_aligned’, u’name’, u’adduct’, u’inchi_key’, u’precursor_mz’, u’measured_precursor_mz’])
#hits.set_index(['database','id','file_name','msms_scan'], inplace=True)

In [ ]:
# Optional: If you already have a pickled hits file and do not need to run get_msms_hits again, uncomment this block
# hits = pickle.load(open(os.path.join(output_dir,'hits.pkl'), "rb"))

7. Adjust Retention Times.

This block creates an interactive plot. The top panel displays MSMS from within the two green RT bounds selected below (rt min and max, initially set in atlas). When the database holds reference spectra, mirror plots are generated with the reference spectra inverted below the sample spectra. The lower panel displays the EICs overlayed for all of the files in your selected groups. You can highlight your groups different colors. It is recommended that you do this, at least, for your extraction blank (or if not available, use a solvent injection blank). This plot also displays radio buttons that can be interactively selected; the values will be exported in your final identifications table and in your atlas export. Use these to mark peak/MSMS quality.

How to use:

  1. STEP 1: Set peak flag radio buttons
    1. OPTION A (custom flags): fill out the peak flags list (list of strings) peak_flag_list = ('A','B') some recommendations are below.
    2. OPTION B (default flags): comment out the custom peak_flag_list line. Uncomment the default peak_flags = "". Flags default to: keep, remove, unresolvable isomers, check.
  2. STEP 2: Set EIC colors
    1. Option A (custom EIC colors): fill out the colorlist in the format of below

      colorlist = [['color1nameorhexadec','partialgroupstring1'], ['color2nameorhexadec','partialgroupstring2']]


      • You can add more comma delimited colors/groups as needed.

    2. These are partial strings that match to you file names (not your group names).
    3. The order they are listed in your list is the order they are displayed in the overlays (first is front, last is back)
    4. Named colors available in matplotlib are here: https://matplotlib.org/3.1.0/gallery/color/named_colors.html or use hexadecimal values '#000000'
    5. </ul> B. Option B (default EIC colors): comment out the custom colorlist lines and uncomment the default colorlist = "". Colors all default to black.

  3. User the right/left buttons on your keyboard to cycle through compounds in your atlas.
  4. Use the up/down buttons on your keyboard to cycle through MSMS spectra within the RT bounds of the lower plot.
  5. Use the horizontal rt min and rt max bars below the plots to adjust the rt bounds around your peak. If there are multiple peaks, select one at a time and then click up/down to update the msms available in that new RT range. If necessary evaluate your data in an external program such as mzmine to make sure you are selecting the correct peak.

TIPS: use compound_idx = 0 in step 3 to change to a different compound in your atlas using the index number. If your plot does not fit in your browser window, adjust height and width values. Use alpha to change the transparency of the lines this is a value 0 (transparent) to 1 (opaque).

DO NOT change your RT theoretical peak (the purple line). It is locked from editing (unless you change a hidden parameter) and only to be changed in special cases. The measured retention times of your peaks will be calculated and exported in your output files. These will be compared with the RT theoreticals and used in your evidence of identification table.


In [ ]:
###STEP 1: Set the peak flag radio buttons using one of the two lines below, for custom flags or default flags

peak_flag_list=('','L1+ - 1 pk, good RT&MSMS','L1+ - known isomer overlap','L1+ - 1 pk, good RT, MSMS ok (coisolated mz/partial match/low int)','L1+ - 1 pk, good RT&MSMS from external library','L1 - 1 pk, correct RT, no MSMS or int too low for matching','L1 - 1 pk, good RT, very low intensity/poor pk shape','L2 put comp','L3 putative class','Remove - background/noise','Remove - bad EMA MSMS','Remove - bad MSMS NIST/MONA/Metlin')
#peak_flag_list =""       # this will default to ('keep','remove','unresolvable isomers','poor peak shape')

###STEP 2: Set the EIC line colors using on of the two lines below, for custom colors or default 
colorlist= [['red','peas'],                                                     
           ['green','beets']]
#colorlist=""   # this will default to black

###STEP 3
a = dp.adjust_rt_for_selected_compound(metatlas_dataset, msms_hits=hits,peak_flags=peak_flag_list, color_me = colorlist, compound_idx=0,alpha=0.5,width=15,height=4.5)

8. Export results files

Export Atlas to a Spreadsheet

The peak flags that you set and selected from the rt adjuster radio buttons will be saved in a column called id_notes


In [ ]:
atlas_identifications = dp.export_atlas_to_spreadsheet(my_atlas,os.path.join(project_directory,output_subfolder,'%s%s.csv' % (my_atlas.name,"export")))
print my_atlas.name

Export MSMS match scores, stats sheets, and final identification table

This block creates a number of files:

  1. compound_scores.csv
  2. stats_table.tab
  3. filtered and unfiltered peak heights, areas, msms scores, mz centroid, mz ppm error, num of fragment matches, rt delta, rt peak
  4. final identification sheet that is formatted for use as a supplemental table for manuscript submission. You will need to manually complete some columns. Please discuss with Ben, Katherine, Daniel or Suzie before using for the first time.

THe kwargs below will set the filtering points for the parameters indicated.


In [ ]:
kwargs = {'min_intensity': 1e4,   # strict = 1e5, loose = 1e3
          'rt_tolerance': .5,    #>= shift of median RT across all files for given compound to reference
          'mz_tolerance': 20,      # strict = 5, loose = 25; >= ppm of median mz across all files for given compound relative to reference
          'min_msms_score': .6, 'allow_no_msms': True,     # strict = 0.6, loose = 0.3 <= highest compound dot-product score across all files for given compound relative to reference
          'min_num_frag_matches': 1, 'min_relative_frag_intensity': .001}   # strict = 3 and 0.1, loose = 1, 0.01 number of matching mzs when calculating max_msms_score and ratio of second highest to first highest intensity of matching sample mzs
scores_df = fa.make_scores_df(metatlas_dataset,hits)
scores_df['passing'] = fa.test_scores_df(scores_df, **kwargs)

pass_atlas_df, fail_atlas_df, pass_dataset, fail_dataset = fa.filter_atlas_and_dataset(scores_df, atlas_df, metatlas_dataset, column='passing')

fa.make_stats_table(input_dataset = metatlas_dataset, msms_hits = hits, output_loc = output_dir,min_peak_height=1e5,use_labels=True,min_msms_score=0.01,min_num_frag_matches=1,include_lcmsruns = [],exclude_lcmsruns = ['QC'])
scores_df.to_csv(os.path.join(output_dir, 'stats_tables','compound_scores.csv'))

Export EIC chromatograms as individual pdfs for each compound

  1. There are three options for formatting your EIC output using the "group =" line below:
    1. 'page' will print each sample group on a new page of a pdf file
    2. 'index' will label each group with a letter
    3. None will print all of the groups on one page with very small subplot labels
  2. The Y axis scale can be shared across all files using share_y = True or set to the max within each file using share_y = False

In [ ]:
group = 'index' # 'page' or 'index' or None
save = True
share_y = True

dp.make_chromatograms(input_dataset=metatlas_dataset, group=group, share_y=share_y, save=save, output_loc=output_dir)

Export MSMS mirror plots as individual pdfs for each compound

  1. use_labels = True will use the compound names you provided in your atlas, if you set it to false, the compounds will be named with the first synonym available from pubchem which could be a common name, iupac name, cas number, vendor part number, etc.
  2. The include and exclude lists will match partial strings in filenames, do not use wildcards.

In [ ]:
dp.make_identification_figure_v2(input_dataset = metatlas_dataset, msms_hits=hits, use_labels=True, include_lcmsruns = [],exclude_lcmsruns = ['InjBl','QC','Blank','blank'], output_loc=os.path.join(output_dir,'msms_mirror_plots'))

Error bar plots

instructions


In [ ]:
peak_height = dp.make_output_dataframe(input_dataset = metatlas_dataset,include_lcmsruns = [],exclude_lcmsruns = [], fieldname='peak_height', output_loc=os.path.join(output_dir,'data_sheets'))
peak_area = dp.make_output_dataframe(input_dataset = metatlas_dataset,include_lcmsruns = [],exclude_lcmsruns = [], fieldname='peak_area', output_loc=os.path.join(output_dir,'data_sheets'))
mz_peak = dp.make_output_dataframe(input_dataset = metatlas_dataset,include_lcmsruns = [],exclude_lcmsruns = [], fieldname='mz_peak', output_loc=os.path.join(output_dir,'data_sheets'))
rt_peak = dp.make_output_dataframe(input_dataset = metatlas_dataset,include_lcmsruns = [],exclude_lcmsruns = [],fieldname='rt_peak', output_loc=os.path.join(output_dir,'data_sheets'))
mz_centroid = dp.make_output_dataframe(input_dataset = metatlas_dataset,include_lcmsruns = [],exclude_lcmsruns = [], fieldname='mz_centroid', output_loc=os.path.join(output_dir,'data_sheets'))
rt_centroid = dp.make_output_dataframe(input_dataset = metatlas_dataset,include_lcmsruns = [],exclude_lcmsruns = [], fieldname='rt_centroid', output_loc=os.path.join(output_dir,'data_sheets'))

dp.plot_errorbar_plots(peak_height, output_loc=os.path.join(output_dir,'error_bar_peak_height'))
dp.plot_errorbar_plots(rt_peak, output_loc=os.path.join(output_dir,'error_bar_rt_peak'))